In [1]:
%autosave 10


Autosaving every 10 seconds

Background

  • Computing is a tool to achieve an objective, not en ends in itself
    • Based on experience of PhD in Quantum Physics

Problem

  • Using machine learning to understand brain function
    • Functional MRI; records time/spatial report of brain activity
  • Learn link between brain activity and cognitive function
    • Feature engineering an input to map it accurately to brain function.
    • This might inform on how brain is actually doing the mapping.

Prior Art

  • Visual image reconstruction from brain activity, Miyawaki et al 2008
    • Very impressive, but not reproducible.
    • Science needs to be reproducible.
  • Make it work, make it right, make it boring.
  • Want to robustly and reliably reproduce scientific results, make them boring

What is a good theory

  1. Accurately describe a large class of observations (training)
  2. Definite predictions about future observations (testing)
  • This is machine learning!
  • Not just minimising error; data driven science is using data to derive better models, not just making classifiers.

Problems with software development in labs

  • Labs are like startups
    • Recruiting talent, keeping them.
    • Limited resources.
    • "Bus factor". How many people can be hit by a bus before your project stops.
  • You really need to engineer software well to survive yourself moving on for whatever reason.
  • Technical debt
    • You need to do the maintenance, documentation, testing, or else your project will inevitably die off.

Patterns in data processing

  1. Interact with data manually.
  2. Automate the interaction.
  3. Go to 1.
  • Iteration goes with consolidation.
    • As you iterate you reduce technical debt and get closer to goal.
  • Academia is moving from statistics to statistical learning (formal machine learning)
    • Mainly due to dimensionality of feature sets.
  • From parameter inference to prediction.

Design philosophy

  1. Don't solve hard problems, bend original problems
    • Judo technique.
  2. Easy setup.
    • Think about installation steps, dependencies, convention over configuration.
  3. Fail gracefully.
    • Robust
    • Easy to debug (major, key success point of Python. Much easier to debug than C).
  4. Quality.
  5. Don't invent a kitchen sink.
    • Narrow your focus as narrowly as possible.
    • This increases the bus factor.
    • As you need features, create new projects and link.

scikit-learn

  • Presenter is a core contributor.
  • Vision: machine learning without knowing the math.
    • A black box, but one that can be opened.
  • Apple vs Linux.
    • Older geeks tend to use Apple products. Things should just work.
  • This module can't magically solve feature engineering for you.
    • But Python is the perfect language to solve this by yourself.
  • Sticking to high-level programming keeps scikit-learn alive.
    • But how do you stay performant at this high level?
    • Optimise algorithms, not low-level stuff.
    • Know NumPy and SciPy perfectly.
    • All data must be arrays/memoryviews. Avoid memory copies, defer to BLAS/LAPACK.
    • Cython.
    • scikit-learn actively avoids C/C++.
      • Increases bus factor.
      • New contributors always complain, but this philosophy works.
  • http://scipy-lectures.github.io

Hierarchical clustering

  • Pull request 2199
  • How
    1. Take two closes clusters
    2. Merge them.
    3. Update distance matrix
  • First approach:
    • How to find minimum? Heaps!
    • Sparse growable strucutres? Skip lists in Cython!
  • Second approach.
    • But C++ map[int, float] is what I need? So wrap it in Cython!

Data vs operations

  • Conceptually, have a big blob of data and operations are agents that walk over the data.
  • Want an imperative-like language.
    • Declarative programming is great in theory but doesn't work in practice.
  • Core grammar
    • fit, predict, tranform, score, partial_fit
  • Grammar instantiated without data.
  • Build pipelines around grammar without data.
    • Configuration/run pattern. a la traits, pyre.
    • This is just convention, very light. You can ignore this if you want, but if you submi a pull request ignoring this you'll get rejected.
  • a la currying in functional programming.
  • a la MVC pattern.
  • APIs are important, and informed by prior art and heuristics, despite how simple they seem

Big data on small hardware

  • Can't afford Hadoop, and want to use Python end-to-end.
  • Off the shelf commodity hardware (laptops!)
  • One trick: online algorithms
    • Compute something on element at a time.
    • e.g. mean of gazillion numbers? Just do a running mean.
    • use algorithms that statistically converge to the true value with some estimatable error.
  • e.g. K-Means clustering.
    • scipy.cluster.vq.kmeans is precise, slow
    • sklearn.cluster.MiniBatchKMeans is statistical, much faster.
  • People complain "I need a cluster to add petabytes of arrays"
    • Why?? Use online algorithms.

Data reductions

  • Remember memory is hierarchical. Reducing data sets allows more to fit in higher levels of the hierarchy.
  • Take random subset
    • Random projection: sklean.random_projection (averages features)
    • e.g. Randomized SVD, sklearn.utils.extmath.randomized_svd
      • Their randomized solution is more accurate than other supposedly precise solutions

Their box

  • 48 cores, 384GB RAM, 70T storage (SSD cache on RAID controller)
  • Faster than an 800 CPU cluster!
  • Do you really need a cluster? Think about data access patterns.

 Parallel processing

  • Only want to care about embarassingly parallel problems
  • Data access / memory bus is going to be the bottleneck.
  • joblib.
    • OpenMP-style. Why not e.g. IPython, multiprocessing, celery.
    • No dependencies.
    • Better tracebacks.
    • Automatic mmap'ing of big arrays, no copies.
    • Lazy dispatching, important for big jobs.
    • With random forests, perfect 100% multi-core CPU allocation with low memoy allocation.

Need caching

  • joblib.Memory, memoize pattern.
  • Stores very large results on-disk, only returns if you get, and even then only if you iterate over it.
  • You must write functions, he will never make a context manager.
  • How to hash input arguments for memoize decorator.
    • hashlib.md5, robust, no dependencies.
    • Subclass the pickler, which is a state machine that walks the object graph.
    • If you walk and find something e.g. ndarrays, don't turn it into a string, just pass a pointer. Avoid copies, use memoryviews.
  • When persisting objects, again subclass pickle and e.g. np.save big numpy arrays.
  • How to handle locking of persisting results of the cache?
    • Rely on renaming directories being atomic, basic POSIX operation.
  • Should I compress data to/from disk?
    • Single core: faster uncompressed
    • Multi core: zlib.compress faster, used again because no dependencies.
    • But use it in an online way.
    • Copyless compression: store meta-data too.
  • Challenges
    • How to stream large results in a cluster.
    • Because too many file is slow, file open is slow on cluster.

Bigger picture - how to make a sustainable project

  • 200 contributors, ~12 core contributors.
  • Huge feature is due to this size of team.
  • Random Forests is getting faster by orders of magnitude because of community contribution.
  1. Focus on quality.
  2. Build great docs and examples.
  • scikit-learn has a very large number of contributors making a large proportional number of commits.
    • Unlike many other Python modules.

Tragedy of the Commons

  • SciPy, NumPy - everyone uses it, but not enough people contribute.
  • These core, vital projects don't get funding.

Heuristics

  1. Set goals correctly. 80/20 rule. Focus on core goals.
  2. Use simplest technology available. This requires great sophistication.
  3. Don't forget - real humans use your package.

 Questions

  • How to encourage contributors?
    • Avoid GUIs with a passion. Don't do it.
    • Just focus on getting users, the rest follows.
    • Don't dumb down the problem.

In [ ]: